Fondamenti di Analisi Dati - Spambase¶

Rosario Scavo (1000037803)¶

The dataset can be downloaded from here: http://archive.ics.uci.edu/dataset/94/spambase

Table of Contents:¶

  • Dataset description
    • Attribute description
  • Dataset Analysis
    • Dataset integrity
    • Descriptive statistics
      • Histogram distributions
      • Word frequencies
      • Feature ratios
      • Hypothesis testing (chi-square test) on features
    • Outlier Analysis
      • Word frequencies
      • Character frequencies
      • Capital Run frequencies
      • Interquartile Range (IQR) Analysis
    • Multicollinearity
  • Classification Algorithms
    • Logistic Regression
      • Multicollinearity in Logistic Regression
      • Reducing predictors
    • Support Vector Machine
      • Grid Search and Cross Validation
      • Impact of Data Normalization
    • Decision Tree
      • Grid Search and Cross Validation
      • Random Forest
    • K-Nearest Neighbors
      • Grid Search and Cross Validation
      • Impact of Data Normalization
  • Conclusion

Dataset description ¶

The dataset includes various types of content that fall under the category of "spam", such as advertisements, chain letters, make-money-fast schemes, and pornography. The spam emails were collected from individuals who reported spam and the postmaster. On the other hand, non-spam emails were collected from personal and work files, where the presence of the word 'george' and the area code '650' were used as indicators of non-spam.

The central goal is to establish a classification rule to identify spam messages based on the frequency of specific words, numbers, characters, or consecutive capital letters in phrases. We will utilize various classification algorithms, including logistic regression (LR), Support Vector Machine (SVM), Decision-Tree, Random Forest and K-nearest neighbors algorithm (KNN), to achieve this. These algorithms will be optimized through appropriate data preparation, transformation, and hyperparameter tuning using built-in Python functions. Additionally, we will determine the appropriate metrics to maximize and their impact on classification performance.

However, effective implementation requires thorough data analysis. Without prior data understanding, employing classifiers becomes challenging, if not impossible. This analysis will involve attribute exploration, variable type verification, missing value identification, feature-level metric analysis (mean, standard deviation, quantiles, etc.), feature importance determination for spam/non-spam classification, and outlier detection and analysis.

In [ ]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from statsmodels.formula.api import logit
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score

import graphviz
from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
warnings.simplefilter('ignore', RuntimeWarning)
In [ ]:
names_list_filepath = 'spambase/names.txt'
attribute_names = []

with open(names_list_filepath, 'r') as file:
    attribute_names = file.read().splitlines()

data = pd.read_csv('spambase/spambase.data', names=attribute_names)
data
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total Class
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.000 0.000 0.0 0.778 0.000 0.000 3.756 61 278 1
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.000 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.010 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.000 0.137 0.0 0.137 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.000 0.135 0.0 0.135 0.000 0.000 3.537 40 191 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4596 0.31 0.00 0.62 0.0 0.00 0.31 0.00 0.00 0.00 0.00 ... 0.000 0.232 0.0 0.000 0.000 0.000 1.142 3 88 0
4597 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.000 0.000 0.0 0.353 0.000 0.000 1.555 4 14 0
4598 0.30 0.00 0.30 0.0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.102 0.718 0.0 0.000 0.000 0.000 1.404 6 118 0
4599 0.96 0.00 0.00 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.000 0.057 0.0 0.000 0.000 0.000 1.147 5 78 0
4600 0.00 0.00 0.65 0.0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.000 0.000 0.0 0.125 0.000 0.000 1.250 5 40 0

4601 rows × 58 columns

Attribute description ¶

  • The last column of 'spambase.data' (Class) indicates whether the email was considered spam (1) or not (0), i.e., unsolicited commercial email.
  • Most attributes indicate whether a specific word or character frequently occurs in the email.
  • Attributes 55-57 (run-length attributes) measure the length of sequences of consecutive capital letters.

Definitions of Attributes:¶

  1. 48 continuous real [0,100] attributes of type word_freq_WORD:

    • Percentage of words in the email that match the specified word.
    • Calculation: $\frac{100 \times (\text{Number of times the word appears in the email})}{\text{Total number of words in the email}}$
  2. 6 continuous real [0,100] attributes of type char_freq_CHAR:

    • Percentage of characters in the email that match the specified character.
    • Calculation: 100 * (number of occurrences of the character) / total characters in the email.
  3. 1 continuous real [1,...] attribute of type capital_run_length_average:

    • Average length of uninterrupted sequences of capital letters.
  4. 1 continuous integer [1,...] attribute of type capital_run_length_longest:

    • Length of the longest uninterrupted sequence of capital letters.
  5. 1 continuous integer [1,...] attribute of type capital_run_length_total:

    • Sum of the length of uninterrupted sequences of capital letters.
    • Total number of capital letters in the email.
  6. 1 nominal {0,1} class attribute of type spam:

    • Denotes whether the email was considered spam (1) or not (0), i.e., unsolicited commercial email.
In [ ]:
data.keys()
Out[ ]:
Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
       'word_freq_our', 'word_freq_over', 'word_freq_remove',
       'word_freq_internet', 'word_freq_order', 'word_freq_mail',
       'word_freq_receive', 'word_freq_will', 'word_freq_people',
       'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you',
       'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
       'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
       'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
       'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
       'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
       'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
       'word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference',
       'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!',
       'char_freq_$', 'char_freq_#', 'capital_run_length_average',
       'capital_run_length_longest', 'capital_run_length_total', 'Class'],
      dtype='object')
  • Number of instances: 4601, of which 1813 are SPAM (39.4%)
  • Number of attributes: 58 (57 continuous, 1 categorical representing the class label).
In [ ]:
class_counts = data['Class'].value_counts()
print(class_counts)
print("\n")
data.info()
Class
0    2788
1    1813
Name: count, dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non-null   float64
 13  word_freq_report            4601 non-null   float64
 14  word_freq_addresses         4601 non-null   float64
 15  word_freq_free              4601 non-null   float64
 16  word_freq_business          4601 non-null   float64
 17  word_freq_email             4601 non-null   float64
 18  word_freq_you               4601 non-null   float64
 19  word_freq_credit            4601 non-null   float64
 20  word_freq_your              4601 non-null   float64
 21  word_freq_font              4601 non-null   float64
 22  word_freq_000               4601 non-null   float64
 23  word_freq_money             4601 non-null   float64
 24  word_freq_hp                4601 non-null   float64
 25  word_freq_hpl               4601 non-null   float64
 26  word_freq_george            4601 non-null   float64
 27  word_freq_650               4601 non-null   float64
 28  word_freq_lab               4601 non-null   float64
 29  word_freq_labs              4601 non-null   float64
 30  word_freq_telnet            4601 non-null   float64
 31  word_freq_857               4601 non-null   float64
 32  word_freq_data              4601 non-null   float64
 33  word_freq_415               4601 non-null   float64
 34  word_freq_85                4601 non-null   float64
 35  word_freq_technology        4601 non-null   float64
 36  word_freq_1999              4601 non-null   float64
 37  word_freq_parts             4601 non-null   float64
 38  word_freq_pm                4601 non-null   float64
 39  word_freq_direct            4601 non-null   float64
 40  word_freq_cs                4601 non-null   float64
 41  word_freq_meeting           4601 non-null   float64
 42  word_freq_original          4601 non-null   float64
 43  word_freq_project           4601 non-null   float64
 44  word_freq_re                4601 non-null   float64
 45  word_freq_edu               4601 non-null   float64
 46  word_freq_table             4601 non-null   float64
 47  word_freq_conference        4601 non-null   float64
 48  char_freq_;                 4601 non-null   float64
 49  char_freq_(                 4601 non-null   float64
 50  char_freq_[                 4601 non-null   float64
 51  char_freq_!                 4601 non-null   float64
 52  char_freq_$                 4601 non-null   float64
 53  char_freq_#                 4601 non-null   float64
 54  capital_run_length_average  4601 non-null   float64
 55  capital_run_length_longest  4601 non-null   int64  
 56  capital_run_length_total    4601 non-null   int64  
 57  Class                       4601 non-null   int64  
dtypes: float64(55), int64(3)
memory usage: 2.0 MB

Dataset analysis ¶

Dataset integrity ¶

Before analyzing the data, let's verify that the 'Class' attribute only contains the values 1 and 0. Additionally, we will check for any NaN values in the dataset.

In [ ]:
data['Class'].unique()
Out[ ]:
array([1, 0])
In [ ]:
count_nan_in_df = data.isnull().sum().sum()
print(f'Number of NaN values: {count_nan_in_df}')
Number of NaN values: 0

For simplicity, we will change the class type to bool and rename it to 'spam.' Consequently, when a record has spam=True, it indicates that the email is spam.

In [ ]:
data['spam'] = data['Class'].astype(bool)
data = data.drop(columns=['Class'])
data['spam']
Out[ ]:
0        True
1        True
2        True
3        True
4        True
        ...  
4596    False
4597    False
4598    False
4599    False
4600    False
Name: spam, Length: 4601, dtype: bool

Utilizing the describe function's min and max lines, which provide insights into the minimum and maximum values for each column, we can confirm that the values of attributes indicating frequencies adhere to the established ranges. Specifically, the lower limit of the range is duly respected, while the upper limit is one unit higher due to the multiplication of frequencies by 100 (percentage), as explained earlier.

Issue: Matrix Sparsity¶

However, a notable observation is that all quartile values are zero. This phenomenon arises from the inherent sparsity of the matrix, where numerous frequency-related values are zero in the majority of records. Consequently, the data is concentrated near zero, introducing noise that could potentially compromise the statistical analysis of the dataset.

To address this issue, in a later stage of the project, a decision was made to replace values equal to 0.0 with NaN for attributes indicating frequencies. This strategic move aimed to mitigate the impact of matrix sparsity, enhancing the dataset's suitability for robust statistical analysis.

In [ ]:
data.describe()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.104553 0.213015 0.280656 0.065425 0.312223 0.095901 0.114208 0.105295 0.090067 0.239413 ... 0.031869 0.038575 0.139030 0.016976 0.269071 0.075811 0.044238 5.191515 52.172789 283.289285
std 0.305358 1.290575 0.504143 1.395151 0.672513 0.273824 0.391441 0.401071 0.278616 0.644755 ... 0.285735 0.243471 0.270355 0.109394 0.815672 0.245882 0.429342 31.729449 194.891310 606.347851
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 6.000000 35.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.065000 0.000000 0.000000 0.000000 0.000000 2.276000 15.000000 95.000000
75% 0.000000 0.000000 0.420000 0.000000 0.380000 0.000000 0.000000 0.000000 0.000000 0.160000 ... 0.000000 0.000000 0.188000 0.000000 0.315000 0.052000 0.000000 3.706000 43.000000 266.000000
max 4.540000 14.280000 5.100000 42.810000 10.000000 5.880000 7.270000 11.110000 5.260000 18.180000 ... 10.000000 4.385000 9.752000 4.081000 32.478000 6.003000 19.829000 1102.500000 9989.000000 15841.000000

8 rows × 57 columns

In [ ]:
data[data['spam'] == True].iloc[:, 0:-4].max()
Out[ ]:
word_freq_make           4.540
word_freq_address        4.760
word_freq_all            3.700
word_freq_3d            42.810
word_freq_our            7.690
word_freq_over           2.540
word_freq_remove         7.270
word_freq_internet      11.110
word_freq_order          3.330
word_freq_mail           7.550
word_freq_receive        2.610
word_freq_will           6.250
word_freq_people         5.550
word_freq_report         4.760
word_freq_addresses      4.410
word_freq_free          16.660
word_freq_business       7.140
word_freq_email          9.090
word_freq_you           12.500
word_freq_credit        18.180
word_freq_your          11.110
word_freq_font          17.100
word_freq_000            5.450
word_freq_money         12.500
word_freq_hp             3.580
word_freq_hpl            1.770
word_freq_george         1.280
word_freq_650            9.090
word_freq_lab            0.470
word_freq_labs           3.380
word_freq_telnet         1.360
word_freq_857            0.470
word_freq_data           2.120
word_freq_415            1.350
word_freq_85             1.910
word_freq_technology     1.620
word_freq_1999           5.050
word_freq_parts          1.560
word_freq_pm             1.880
word_freq_direct         2.220
word_freq_cs             0.100
word_freq_meeting        0.450
word_freq_original       0.890
word_freq_project        1.160
word_freq_re             5.550
word_freq_edu            2.730
word_freq_table          0.460
word_freq_conference     0.770
char_freq_;              1.117
char_freq_(              9.752
char_freq_[              1.171
char_freq_!              7.843
char_freq_$              6.003
char_freq_#             19.829
dtype: float64
In [ ]:
data.iloc[:, :-4] /= 100
data.describe()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.001046 0.002130 0.002807 0.000654 0.003122 0.000959 0.001142 0.001053 0.000901 0.002394 ... 0.000319 0.000386 0.001390 0.000170 0.002691 0.000758 0.000442 5.191515 52.172789 283.289285
std 0.003054 0.012906 0.005041 0.013952 0.006725 0.002738 0.003914 0.004011 0.002786 0.006448 ... 0.002857 0.002435 0.002704 0.001094 0.008157 0.002459 0.004293 31.729449 194.891310 606.347851
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 6.000000 35.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000650 0.000000 0.000000 0.000000 0.000000 2.276000 15.000000 95.000000
75% 0.000000 0.000000 0.004200 0.000000 0.003800 0.000000 0.000000 0.000000 0.000000 0.001600 ... 0.000000 0.000000 0.001880 0.000000 0.003150 0.000520 0.000000 3.706000 43.000000 266.000000
max 0.045400 0.142800 0.051000 0.428100 0.100000 0.058800 0.072700 0.111100 0.052600 0.181800 ... 0.100000 0.043850 0.097520 0.040810 0.324780 0.060030 0.198290 1102.500000 9989.000000 15841.000000

8 rows × 57 columns

Descriptive statistics ¶

Emails can be categorized into two groups: spam and non-spam. To better understand these categories, it is important to calculate fundamental statistics for each group. Furthermore, we aim to pinpoint specific characteristics that could significantly influence the classification of an email.

In [ ]:
spam = data[data['spam'] == True]
non_spam = data[data['spam'] == False]
In [ ]:
spam.describe()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
count 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 ... 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000 1813.000000
mean 0.001523 0.001646 0.004038 0.001647 0.005140 0.001749 0.002754 0.002081 0.001701 0.003505 ... 0.000021 0.000206 0.001090 0.000082 0.005137 0.001745 0.000789 9.519165 104.393271 470.619415
std 0.003106 0.003489 0.004807 0.022191 0.007072 0.003219 0.005721 0.005449 0.003548 0.006314 ... 0.000268 0.000916 0.002821 0.000474 0.007442 0.003605 0.006119 49.846186 299.284969 825.081179
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 2.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000940 0.000000 0.000000 2.324000 15.000000 93.000000
50% 0.000000 0.000000 0.003000 0.000000 0.002900 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000650 0.000000 0.003310 0.000800 0.000000 3.621000 38.000000 194.000000
75% 0.001700 0.002100 0.006400 0.000000 0.007800 0.002400 0.003400 0.001900 0.001900 0.005100 ... 0.000000 0.000000 0.001440 0.000000 0.006450 0.002110 0.000180 5.708000 84.000000 530.000000
max 0.045400 0.047600 0.037000 0.428100 0.076900 0.025400 0.072700 0.111100 0.033300 0.075500 ... 0.007700 0.011170 0.097520 0.011710 0.078430 0.060030 0.198290 1102.500000 9989.000000 15841.000000

8 rows × 57 columns

In [ ]:
non_spam.describe()
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
count 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 ... 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000 2788.000000
mean 0.000735 0.002445 0.002006 0.000009 0.001810 0.000445 0.000094 0.000384 0.000380 0.001672 ... 0.000512 0.000503 0.001586 0.000227 0.001100 0.000116 0.000217 2.377301 18.214491 161.470947
std 0.002978 0.016332 0.005030 0.000213 0.006145 0.002229 0.001105 0.002472 0.001985 0.006432 ... 0.003652 0.003034 0.002606 0.001349 0.008209 0.000696 0.002439 5.113685 39.084792 355.738403
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.384000 4.000000 18.750000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000645 0.000000 0.000000 0.000000 0.000000 1.857000 10.000000 54.000000
75% 0.000000 0.000000 0.001200 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.002220 0.000000 0.000270 0.000000 0.000000 2.555000 18.000000 141.000000
max 0.043400 0.142800 0.051000 0.008700 0.100000 0.058800 0.030700 0.058800 0.052600 0.181800 ... 0.100000 0.043850 0.052770 0.040810 0.324780 0.020380 0.074070 251.000000 1488.000000 5902.000000

8 rows × 57 columns

Histograms distribution ¶

Histograms visually represent the distribution of values within each feature, providing valuable insights into the patterns and tendencies associated with spam and non-spam emails. By scrutinizing these histograms, one can discern any differences in the distributions, thereby gaining valuable insights into the characteristic features that distinguish spam from legitimate messages. For instance, when comparing word_freq_business with word_freq_3d, it is clear that the latter is a good feature for discriminating between spam and nonspam.

In [ ]:
def plot_histogram(feature, spam, non_spam):
    plt.hist(spam[feature], bins=20, alpha=0.5, label='Spam')
    plt.hist(non_spam[feature], bins=20, alpha=0.5, label='Non-Spam')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {feature} for Spam and Non-Spam Emails')
    plt.legend()
    plt.show()
In [ ]:
plot_histogram('word_freq_business', spam, non_spam)
No description has been provided for this image
In [ ]:
plot_histogram('word_freq_3d', spam, non_spam)
No description has been provided for this image

Word frequencies ¶

Certain columns showcase markedly higher maximum values within one class, in contrast with relatively lower values in the counterpart class. These observations provide valuable insights into potential discriminative features crucial for email classification.

In order to identify influential features impacting email classification, we scrutinize the features by averaging the values of the word frequencies and plotting them.

In [ ]:
mean_wf = data.groupby('spam').mean()
mean_wr_fr = mean_wf.iloc[:, 0:-9]
nospam_wr_fr = mean_wr_fr.iloc[0]
spam_wr_fr = mean_wr_fr.iloc[1]

The initial graph presented here juxtaposes the average word frequency values in spam (depicted in orange) and non-spam emails (depicted in blue). Notably, certain words like "3d" (as shown previously) and "you" exhibit higher average frequencies in SPAM emails, while others like "hp," "address," "font," and "george" are more prevalent in non-spam emails. This suggests that the frequency of specific words plays a key role in email classification.

Following a similar approach, I extended the analysis to focus on the frequencies of special characters. It is commonly observed that non-spam emails tend to display a significant presence of such characters.

This comparative analysis provides valuable insights into the distinctive word and character frequency patterns between spam and non-spam emails, contributing to a better understanding of classification dynamics.

In [ ]:
plt.figure(figsize=(16, 9))
plt.bar(nospam_wr_fr.index, nospam_wr_fr.values, width=1, alpha=0.8)
plt.bar(spam_wr_fr.index, spam_wr_fr.values, width=1, alpha=0.8)
plt.xticks(rotation='vertical')
plt.legend(['non_spam', 'spam'])
plt.grid()
plt.show()
No description has been provided for this image

Feature ratios ¶

In order to select influential features impacting email classification, we scrutinize the features by averaging the values within the Spambase dataset and assessing the ratios between spam and non-spam emails. We show only the features greater than the average of the ratio.

In [ ]:
spam_mean = spam.mean()
non_spam_mean = non_spam.mean()
spam_diff = pd.concat(
    [spam_mean, non_spam_mean, spam_mean/non_spam_mean], axis=1)
# remove last row (spam column)
spam_diff = spam_diff[:-1]
spam_diff.columns = ['Spam', 'Non-Spam', 'Ratio']

spam_diff.sort_values(by='Ratio', ascending=False, inplace=True)
In [ ]:
spam_diff_mean = spam_diff['Ratio'].mean()
selected_spam_diff = spam_diff[spam_diff['Ratio'] > spam_diff_mean]
selected_spam_diff
Out[ ]:
Spam Non-Spam Ratio
word_freq_3d 0.001647 0.000009 185.872477
word_freq_000 0.002471 0.000071 34.857704
word_freq_remove 0.002754 0.000094 29.351310
word_freq_credit 0.002055 0.000076 27.117520
char_freq_$ 0.001745 0.000116 14.978608
word_freq_addresses 0.001121 0.000083 13.474663
word_freq_money 0.002129 0.000171 12.421667

We can plot the distribution of the ratios to have a better idea.

In [ ]:
spam_diff['Ratio'].plot(kind='bar', figsize=(10, 6))
plt.xlabel('Features')
plt.ylabel('Ratio')
plt.title('Spam vs Non-Spam Ratio Comparison')
plt.show()
No description has been provided for this image
In [ ]:
spam_indicators = list(selected_spam_diff.index.values)
spam_indicators.append('spam')
spam_indicators
Out[ ]:
['word_freq_3d',
 'word_freq_000',
 'word_freq_remove',
 'word_freq_credit',
 'char_freq_$',
 'word_freq_addresses',
 'word_freq_money',
 'spam']

Upon closer examination of certain word pairs, a discernible trend emerges: the joint appearance of both words in an email often suggests a higher likelihood of it being classified as spam. Furthermore, there is an intriguing correlation with word frequency, where a higher frequency is indicative of a higher likelihood of the email being categorized as spam.

In [ ]:
%%warnings ignore

pair_spam = sns.pairplot(data[spam_indicators].iloc[::-1], hue="spam")
pair_spam.fig.suptitle('SPAM indicators', y=1.01, fontsize=20)
UsageError: Cell magic `%%warnings` not found.

Hypothesis testing (chi-square test) on features ¶

The p-values obtained through the chi-square test serve as crucial indicators in understanding the relationship between the examined feature (independent variable) and the target variable 'spam.' The null hypothesis, in this context, posits no association or difference between the feature and the likelihood of an email being classified as spam.

Interpretation Guidelines:¶
  • Small p-value (e.g., < 0.05):

    • Conclusion: Reject the null hypothesis.
    • Implication: Strong evidence exists, suggesting an association or difference between the feature and the 'spam' variable. The feature is likely to be statistically significant in predicting spam.
  • Large p-value (e.g., > 0.05):

    • Conclusion: Fail to reject the null hypothesis.
    • Implication: Insufficient evidence to conclude an association or difference between the feature and the 'spam' variable. The feature may not be statistically significant in predicting spam.

A commonly used significance level (alpha) is 0.05. If a p-value is less than or equal to alpha, the null hypothesis is rejected. Careful consideration of these p-values allows the identification of features that play a significant role in predicting spam.

To calculate the p-values we can engage the Chi-Square Test:

The Chi-Square test is a statistical method used to determine if there is a significant association between two categorical variables. It compares the observed distribution of categorical data with the distribution that would be expected if the variables were independent. The test yields a p-value, indicating the probability of obtaining the observed distribution by chance.

Formula:

The Chi-Square test statistic (χ²) is calculated using the formula:

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} $$

where:

  • $O_i$ is the observed frequency in each category,
  • $E_i$ is the expected frequency in each category assuming independence.

The test compares the sum of squared differences between observed and expected frequencies, normalized by the expected frequencies. A higher Chi-Square value suggests a greater difference between observed and expected values, and a lower p-value indicates stronger evidence against the null hypothesis of independence.

In [ ]:
p_values = {}
for column in data.columns[:-1]:
    contingency_table = pd.crosstab(data[column], data['spam'])
    _, p_value, _, _ = chi2_contingency(contingency_table)
    p_values[column] = round(p_value, 5)
In [ ]:
pd.crosstab(data['capital_run_length_total'], data['spam'])
Out[ ]:
spam False True
capital_run_length_total
1 9 0
2 8 5
3 31 1
4 46 1
5 114 1
... ... ...
9088 0 1
9090 0 1
9163 0 1
10062 0 1
15841 0 1

919 rows × 2 columns

In [ ]:
sorted_p_values = dict(
    sorted(p_values.items(), key=lambda item: float(item[1]), reverse=True))
keys = sorted_p_values.keys()
values = [float(v) for v in sorted_p_values.values()]

plt.figure(figsize=(10, 5))
plt.bar(keys, values)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('P-values')
plt.title('P-values for each feature')
plt.show()
No description has been provided for this image

Features with Higher P-Values:

  • word_freq_cs (0.73532)
  • word_freq_table (0.28752)
  • word_freq_conference (0.25999)
  • word_freq_project (0.23742)
  • word_freq_parts (0.22839)
  • word_freq_data (0.13572)
  • char_freq_[ (0.11276)
  • word_freq_meeting (0.08921)

These results suggest that the mentioned features may not be as discriminatory in identifying spam emails compared to others. Notably, features selected previously as spam indicators, such as 'word_freq_3d,' 'word_freq_remove,' 'word_freq_addresses,' 'word_freq_credit,' 'word_freq_000,' 'word_freq_money,' and 'char_freq_$,' are conspicuously absent from the list. This absence indicates their potential efficacy as strong indicators of spam emails in the Spambase dataset.

In [ ]:
p_value_threshold = 0.05
non_significant_indicators = {
    k: v for k, v in sorted_p_values.items() if v > p_value_threshold}
non_significant_indicators
Out[ ]:
{'word_freq_cs': 0.73532,
 'word_freq_table': 0.28752,
 'word_freq_conference': 0.25999,
 'word_freq_project': 0.23742,
 'word_freq_parts': 0.22839,
 'word_freq_data': 0.13572,
 'char_freq_[': 0.11276,
 'word_freq_meeting': 0.08921}
In [ ]:
spam_indicators = spam_indicators[:-1]
spam_indicators
Out[ ]:
['word_freq_3d',
 'word_freq_000',
 'word_freq_remove',
 'word_freq_credit',
 'char_freq_$',
 'word_freq_addresses',
 'word_freq_money']

Outlier Analysis ¶

In our exploration of the Spambase dataset, we aimed to identify outliers and understand their impact on the data. Initially, we plotted a boxplot for all the word frequency features, disregarding the distinction between spam and non-spam emails. This allowed us to observe the overall distribution of the data and identify potential outliers.

In [ ]:
data_wr_fr = data.iloc[:, :-10]
data_char_freq = data.iloc[:, -10:-4]
data_capital_run = data.iloc[:, -4:-1]
In [ ]:
def draw_boxplot(ax, label, data):
    ax.boxplot(data,
               vert=True,
               patch_artist=True,
               labels=data.columns)
    ax.set_title(label)
    ax.yaxis.grid(True)
    ax.tick_params(labelrotation=90)

fig, ax = plt.subplots(figsize=(20, 5))
draw_boxplot(ax, 'Boxplots Word Frequencies', data_wr_fr)
plt.show()
No description has been provided for this image

Word frequencies ¶

To gain a deeper understanding of outliers within each class, we opted to create separate boxplots for spam and non-spam emails. This more nuanced approach allows us to discern specific characteristics within each category and better comprehend the distinctions in outliers between spam and non-spam instances.

  • Symmetry Differences:

    • The symmetry of the same feature differs between spam and non-spam classes. For instance, consider word_freq_george, a feature used to label non-spam emails. The asymmetry suggests that this feature may not exhibit similar behavior across both classes.
  • Distinctive Spam Features:

    • Certain features, such as word_freq_3d and word_freq_credit, clearly stand out as potential indicators for classifying spam emails. These features demonstrate notable differences between spam and non-spam distributions, as previously stated.
  • Potential Non-Spam Indicators:

    • In examining boxplots for non-spam emails, features like word_freq_hp, word_freq_lab, and word_freq_meeting emerge as potential indicators. Notably, non-spam distributions seem to harbor more outliers, suggesting potential discriminative power in these features.
  • Common Minimal Value:

    • It's important to note that each boxplot has a minimum value of 0, reflecting the inherent nature of the features, which are always positive.

These insights derived from the boxplots contribute to a nuanced understanding of feature behaviors within spam and non-spam categories, aiding in the identification of key indicators for effective email classification.

In [ ]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(20, 9))
draw_boxplot(ax1, 'Boxplots Word Frequencies for SPAM emails', data_wr_fr[data['spam']==True])
draw_boxplot(ax2, 'Boxplots Word Frequencies for NON-SPAM emails', data_wr_fr[data['spam']==False])
fig.subplots_adjust(hspace=0.8)
plt.show()
No description has been provided for this image

Character frequencies ¶

Observing the outlier distributions concerning the frequency of certain characters, notable distinctions emerge:

  1. char_freq_$:

    • A noticeable difference in distribution is observed between spam and non-spam emails for char_freq_$. This discrepancy aligns with the common practice in spam emails, where fraudulent messages often involve mentions of free offerings (0$) or repetitive use of symbols like $$$.
  2. char_freq_;:

    • Another significant feature is char_freq_;, which displays a correlation with non-spam emails beyond a certain frequency threshold. This correlation is logical as the presence of ; is often associated with well-organized text (oppositely to this report :/)
General Observation on Outliers:¶

In summary, outliers in our dataset carry informative value and contribute to the classification process. Notably, features like char_freq_$ and char_freq_; showcase distinctive patterns between spam and non-spam emails. Consequently, the decision has been made not to remove any outliers from our data.

In [ ]:
char_freq_cols = data_char_freq.columns
fig, axes = plt.subplots(1, len(char_freq_cols), figsize=(18, 6))

for i, col in enumerate(char_freq_cols):
    data.boxplot(by='spam', column=col, ax=axes[i])
    axes[i].set_title(col)
    
fig.suptitle('Comparison of Character Frequencies')
plt.tight_layout()
plt.show()
No description has been provided for this image

Capital Run frequencies ¶

We also investigated the role of capital letters in distinguishing between spam and non-spam emails. Notably, the feature capital_run_length_average caught our interest, as it represents the average length of consecutive sequences of capital letters in an email. This metric proved to be a valuable indicator, showcasing a higher average presence of consecutive capital letters in spam emails compared to non-spam counterparts.

Upon visualizing the data, we observed that spam emails exhibit a tendency towards longer consecutive sequences of capital letters, suggesting a potential pattern that could aid in classification.

In [ ]:
data_capital_run = data_capital_run.columns
fig, axes = plt.subplots(1, len(data_capital_run), figsize=(18, 6))

for i, col in enumerate(data_capital_run):
    data.boxplot(by='spam', column=col, ax=axes[i])
    axes[i].set_title(col)
    
fig.suptitle('Comparison of Capital Run Frequencies')
plt.tight_layout()
plt.show()
No description has been provided for this image

Interquartile Range (IQR) Analysis ¶

The Interquartile Range (IQR) serves as a crucial measure of statistical dispersion, representing the range between the first quartile (Q1) and the third quartile (Q3) within a dataset. It provides insights into the spread of the middle 50% of the data.

Interpreting the Values:¶

  • Each column in iqr_df corresponds to a feature from the dataset.
  • A larger IQR suggests a greater variability in the middle 50% of the data for a specific feature.
  • A small IQR indicates that the central portion of the data is concentrated in a narrow range.
  • By comparing IQR values between "spam" and "non_spam," we can identify features where the spread of data significantly differs for the two categories.

Example Interpretation:¶

For instance, if iqr_df indicates that the IQR for feature X is substantially larger in the "spam" category compared to the "nospam" category, it suggests that the spread of values for feature X is more diverse among spam instances.

This IQR analysis provides valuable insights into the distributional differences within numerical features, aiding in the identification of characteristics that may contribute to the classification of spam and non-spam instances.

In [ ]:
columns = data.iloc[:, :-1].columns
spam_iqr = []
non_spam_iqr = []
for col in columns:
    spam_q1,spam_q3 = data[data['spam']==True][col].quantile([1/4,3/4])
    non_spam_q1, non_spam_q3 = data[data['spam']==False][col].quantile([1/4,3/4])
    spam_iqr.append(spam_q3-spam_q1)
    non_spam_iqr.append(non_spam_q3-non_spam_q1)

iqr_df = pd.DataFrame([spam_iqr, non_spam_iqr], columns=columns, index=["spam", "non_spam"])
iqr_df
Out[ ]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_conference char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total
spam 0.0017 0.0021 0.0064 0.0 0.0078 0.0024 0.0034 0.0019 0.0019 0.0051 ... 0.0 0.0 0.00144 0.0 0.00551 0.00211 0.00018 3.384 69.0 437.00
non_spam 0.0000 0.0000 0.0012 0.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... 0.0 0.0 0.00222 0.0 0.00027 0.00000 0.00000 1.171 14.0 122.25

2 rows × 57 columns

The observed values in our analysis further confirm the earlier assertions regarding features that appear to be particularly informative in discerning between spam and non-spam emails. Notably, features such as capital_run_length_total, word_freq_free, and others exhibit noticeable distinctions in their Interquartile Range (IQR) values when stratified by the spam and non-spam classes.

The IQR, a measure of statistical dispersion, provides insight into the data spread within each class. The discernible differences in IQR values between spam and non-spam instances for specific features suggest that these variables carry significant discriminatory potential. For instance, capital_run_length_total indicates a variance in the total length of consecutive capital letters, while word_freq_free reflects the frequency of the word "free" in the email.

These findings reinforce the hypothesis that certain features possess inherent patterns or characteristics that contribute significantly to classifying emails into spam or non-spam categories.

In [ ]:
iqr_df.loc['abs_diff'] = abs(iqr_df.loc["non_spam"] - iqr_df.loc["spam"])
iqr_df_transposed = iqr_df.T
iqr_df_transposed[iqr_df_transposed['abs_diff']>0].sort_values(by='abs_diff', ascending=False)
Out[ ]:
spam non_spam abs_diff
capital_run_length_total 437.00000 122.250000 314.750000
capital_run_length_longest 69.00000 14.000000 55.000000
capital_run_length_average 3.38400 1.171000 2.213000
word_freq_your 0.01490 0.004600 0.010300
word_freq_hp 0.00000 0.010000 0.010000
word_freq_our 0.00780 0.000000 0.007800
word_freq_free 0.00640 0.000000 0.006400
char_freq_! 0.00551 0.000270 0.005240
word_freq_all 0.00640 0.001200 0.005200
word_freq_mail 0.00510 0.000000 0.005100
word_freq_email 0.00390 0.000000 0.003900
word_freq_remove 0.00340 0.000000 0.003400
word_freq_business 0.00340 0.000000 0.003400
word_freq_000 0.00340 0.000000 0.003400
word_freq_hpl 0.00000 0.003300 0.003300
word_freq_money 0.00290 0.000000 0.002900
word_freq_re 0.00050 0.003125 0.002625
word_freq_over 0.00240 0.000000 0.002400
char_freq_$ 0.00211 0.000000 0.002110
word_freq_address 0.00210 0.000000 0.002100
word_freq_order 0.00190 0.000000 0.001900
word_freq_internet 0.00190 0.000000 0.001900
word_freq_make 0.00170 0.000000 0.001700
word_freq_people 0.00170 0.000000 0.001700
word_freq_george 0.00000 0.001625 0.001625
word_freq_receive 0.00140 0.000000 0.001400
word_freq_1999 0.00000 0.001000 0.001000
word_freq_will 0.00840 0.007525 0.000875
char_freq_( 0.00144 0.002220 0.000780
word_freq_you 0.02050 0.019925 0.000575
char_freq_# 0.00018 0.000000 0.000180

Multicollinearity ¶

Before starting with the classification methods, as we studied the subject, we should address a potential issue in our dataset: multicollinearity. Multicollinearity occurs when independent variables in a multiple regression model display high correlations among themselves. This correlation between independent variables (our features) can pose challenges in distinguishing the individual effects of these features on the dependent variable (the class of the email, spam, or non-spam). In such cases, we can try to remove the correlated variables (we can identify correlated variables with the correlation matrix, by plotting an heatmap), apply some feature selection method or perform Principal Component Analysis (PCA).

Analyzing Correlations in the Spambase Dataset¶

In this analysis of the Spambase dataset, we aim to uncover potential relationships and dependencies between different parameters by utilizing a correlation matrix. A correlation matrix is a tabular representation of correlation coefficients between variables in a dataset. The correlation coefficient quantifies the strength and direction of a linear relationship between two variables. In this specific investigation, we choose to employ the Kendall correlation coefficient as opposed to Pearson, considering the presence of outliers in the dataset. The Kendall correlation coefficient is particularly robust in scenarios with outliers and non-normally distributed data. It measures the strength of dependence between two variables by comparing the number of concordant and discordant pairs of observations. The formula for calculating the Kendall correlation coefficient, denoted as $\tau$, is as follows:

$$\tau = \frac{{\text{{Number of concordant pairs}} - \text{{Number of discordant pairs}}}}{{\text{{Total number of pairs}}}}$$

Here, concordant pairs are those with the same order of ranks in both variables, while discordant pairs have different orderings. By employing the Kendall correlation coefficient, we aim to gain insights into potential associations among the parameters in the Spambase dataset while accounting for its unique characteristics, including the presence of outliers.

In [ ]:
import seaborn as sns

new_df = data.iloc[:, :-1].copy()

plt.rcParams.update({'figure.figsize':(60,55), 'figure.dpi':100})

correlation_matrix = new_df.corr(method='spearman')
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", vmin=-1, vmax=1, cbar=True, cmap='coolwarm', annot_kws={'size': 15})
plt.show()
No description has been provided for this image
In [ ]:
threshold = 0.7
high_corr = correlation_matrix[abs(correlation_matrix) > threshold]
np.fill_diagonal(high_corr.values, np.nan)
mask = np.triu(np.ones_like(high_corr, dtype=bool))
inverse_mask = ~mask

high_corr_masked = high_corr * inverse_mask
high_corr_masked.dropna(how='all', axis=1, inplace=True)
high_corr_masked.dropna(how='all', axis=0, inplace=True)

mask = np.triu(np.ones_like(high_corr_masked, dtype=bool))

plt.rcParams.update({'figure.figsize':(15,11), 'figure.dpi':100})

sns.heatmap(high_corr_masked, mask=mask, annot=True, fmt=".2f", vmin=-1, vmax=1, cbar=True, cmap='coolwarm', annot_kws={'size': 15})
plt.show()
No description has been provided for this image

We can decide to remove

  • word_freq_hpl since it is correlated with word_freq_hp
  • word_freq_telnet since it is correlated with word_freq_857 and word_freq_415
  • word_freq_857 since it is correlated with word_freq_415
  • word_freq_85 since it is correlated with word_freq_650
  • capital_run_length_longest since it is correlated with capital_run_length_average
In [ ]:
high_corr_attributes = ['word_freq_hpl', 'word_freq_telnet', 'word_freq_857', 'word_freq_85', 'capital_run_length_longest']

Classification Algorithms ¶

Logistic Regression ¶

In this analysis, we utilize the logistic regression model from the statsmodels library to classify the emails.

To assess the performance of our logistic regression model, we consider several key statistical metrics:

  1. Pseudo R-squared:

Pseudo R-squared is an important metric in the context of logistic regression, which serves as an analogous measure to the R-squared used in linear regression. It provides an indication of the explanatory power of the model, pointing out how well the logistic model performs compared to a baseline model that predicts the outcome using no features. It's a tool for model comparison rather than an absolute measure of fit. The value of Pseudo R-squared lies between 0 and 1. A value closer to 1 indicates that the model has a strong explanatory power.

  1. LLR (Log-Likelihood Ratio) p-value: This metric tests the null hypothesis that all coefficients are zero (i.e., the model is no better than an intercept-only model). A small LLR p-value suggests that our model is statistically significant in distinguishing between spam and non-spam emails.

  2. P > |t| for the parameters: This indicates the probability of observing a t-statistic as extreme as the one computed under the null hypothesis that a particular coefficient is zero. Smaller values suggest that the corresponding feature plays a significant role in predicting whether an email is spam.

By analyzing these parameters, we can understand not only the performance of our model but also the importance of different features in the spam detection task.

In [ ]:
names_list_filepath = 'spambase/names.txt'
attribute_names = []

with open(names_list_filepath, 'r') as file:
    attribute_names = file.read().splitlines()

data = pd.read_csv('spambase/spambase.data', names=attribute_names)

data.iloc[:, :-4] /= 100
In [ ]:
column_name_mapping = {'char_freq_;':'char_freq_semicolon',
                       'char_freq_(':'char_freq_round_bracket', 
                       'char_freq_[':'char_freq_square_bracket', 
                       'char_freq_!':'char_freq_exclamation',
                       'char_freq_#':'char_freq_hash',
                       'char_freq_$':'char_freq_dollar',
                       'Class':'spam'}

data.rename(columns=column_name_mapping, inplace=True)
data_attributes = data.columns.tolist()[:-1]

Multicollinearity in Logistic Regression ¶

In our analysis of the Spambase dataset using logistic regression, we encounter an issue of multicollinearity (as shown previously). When multicollinearity is present, it becomes difficult to isolate the individual effect of each predictor on the response variable. High multicollinearity among predictors leads to inflated standard errors in the regression coefficients, which in turn can lead to a wider confidence interval and less reliable probability values (P > |t|) for the hypothesis tests.

  • Evidence of Multicollinearity - Singular Matrix Error:
    In our case, the application of logistic regression to the Spambase dataset using all features resulted in a LinAlgError: Singular matrix. This error is indicative of a perfect or near-perfect collinearity among some of the variables. It implies that the matrix of predictors cannot be inverted, which is a requisite for the regression analysis. This scenario often arises when the data includes redundant variables (predictors that are linear combinations of other predictors).
In [ ]:
email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)

formula = "spam ~ " + " + ".join(data_attributes)

model = logit(formula, email_train).fit()
summary = model.summary()
summary
Warning: Maximum number of iterations has been exceeded.
         Current function value: inf
         Iterations: 35
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb Cell 75 line 5
      <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a> email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)
      <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a> formula = "spam ~ " + " + ".join(data_attributes)
----> <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4'>5</a> model = logit(formula, email_train).fit()
      <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5'>6</a> summary = model.summary()
      <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6'>7</a> summary

File ~/anaconda3/envs/fad/lib/python3.9/site-packages/statsmodels/discrete/discrete_model.py:2599, in Logit.fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   2596 @Appender(DiscreteModel.fit.__doc__)
   2597 def fit(self, start_params=None, method='newton', maxiter=35,
   2598         full_output=1, disp=1, callback=None, **kwargs):
-> 2599     bnryfit = super().fit(start_params=start_params,
   2600                           method=method,
   2601                           maxiter=maxiter,
   2602                           full_output=full_output,
   2603                           disp=disp,
   2604                           callback=callback,
   2605                           **kwargs)
   2607     discretefit = LogitResults(self, bnryfit)
   2608     return BinaryResultsWrapper(discretefit)

File ~/anaconda3/envs/fad/lib/python3.9/site-packages/statsmodels/discrete/discrete_model.py:243, in DiscreteModel.fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    240 else:
    241     pass  # TODO: make a function factory to have multiple call-backs
--> 243 mlefit = super().fit(start_params=start_params,
    244                      method=method,
    245                      maxiter=maxiter,
    246                      full_output=full_output,
    247                      disp=disp,
    248                      callback=callback,
    249                      **kwargs)
    251 return mlefit

File ~/anaconda3/envs/fad/lib/python3.9/site-packages/statsmodels/base/model.py:582, in LikelihoodModel.fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    580     Hinv = cov_params_func(self, xopt, retvals)
    581 elif method == 'newton' and full_output:
--> 582     Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    583 elif not skip_hessian:
    584     H = -1 * self.hessian(xopt)

File ~/anaconda3/envs/fad/lib/python3.9/site-packages/numpy/linalg/linalg.py:561, in inv(a)
    559 signature = 'D->D' if isComplexType(t) else 'd->d'
    560 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 561 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    562 return wrap(ainv.astype(result_t, copy=False))

File ~/anaconda3/envs/fad/lib/python3.9/site-packages/numpy/linalg/linalg.py:112, in _raise_linalgerror_singular(err, flag)
    111 def _raise_linalgerror_singular(err, flag):
--> 112     raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix
Addressing Multicollinearity:¶

To resolve this issue, we can try to remove highly correlated features.

  • Variables Removed to Reduce Multicollinearity:
    We identified and removed the following variables due to their high correlation with other features in the dataset (they are the same features we found before):

    • word_freq_hpl
    • word_freq_telnet
    • word_freq_857
    • word_freq_85
    • capital_run_length_longest
  • Impact on the Model:
    After removing these features, the total number of features in our model was reduced to 52. This adjustment yielded a Pseudo R-squared of 0.7205, indicating a relatively strong explanatory power of the model with the reduced set of predictors.

  • Convergence Issue:
    Despite these adjustments, an important issue arose: the model did not converge after 35 iterations.

  • Next Steps:

    1. Simplifying the model by reducing the number of predictors.
    2. Adjusting the fitting algorithm, such as increasing the number of iterations or changing the convergence criteria.
In [ ]:
high_corr_attributes
data_attributes_no_corr = [attr for attr in data_attributes if attr not in high_corr_attributes]
In [ ]:
email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)

formula = "spam ~ " + " + ".join(data_attributes_no_corr)

model = logit(formula, email_train).fit()
summary = model.summary()
summary
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.187157
         Iterations: 35
Out[ ]:
Logit Regression Results
Dep. Variable: spam No. Observations: 3450
Model: Logit Df Residuals: 3397
Method: MLE Df Model: 52
Date: Tue, 19 Dec 2023 Pseudo R-squ.: 0.7205
Time: 17:26:37 Log-Likelihood: -645.69
converged: False LL-Null: -2310.5
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -1.8215 0.167 -10.923 0.000 -2.148 -1.495
word_freq_make -45.4364 27.856 -1.631 0.103 -100.033 9.160
word_freq_address -11.2960 8.086 -1.397 0.162 -27.143 4.551
word_freq_all 22.2041 14.407 1.541 0.123 -6.034 50.442
word_freq_3d 188.9055 141.728 1.333 0.183 -88.877 466.688
word_freq_our 74.6857 14.184 5.265 0.000 46.885 102.486
word_freq_over 87.0471 29.119 2.989 0.003 29.975 144.119
word_freq_remove 236.9795 39.456 6.006 0.000 159.647 314.312
word_freq_internet 45.5503 15.609 2.918 0.004 14.958 76.142
word_freq_order 76.2801 38.573 1.978 0.048 0.679 151.881
word_freq_mail 2.2032 7.645 0.288 0.773 -12.782 17.188
word_freq_receive 12.3526 37.611 0.328 0.743 -61.364 86.069
word_freq_will -12.7096 8.412 -1.511 0.131 -29.196 3.777
word_freq_people 7.2314 29.409 0.246 0.806 -50.409 64.871
word_freq_report 20.8883 16.436 1.271 0.204 -11.325 53.102
word_freq_addresses 86.8922 71.429 1.216 0.224 -53.107 226.891
word_freq_free 108.4484 17.778 6.100 0.000 73.605 143.292
word_freq_business 93.3197 25.837 3.612 0.000 42.681 143.959
word_freq_email 0.9827 13.957 0.070 0.944 -26.372 28.337
word_freq_you 11.2640 4.127 2.729 0.006 3.175 19.353
word_freq_credit 159.7084 89.855 1.777 0.076 -16.404 335.821
word_freq_your 29.1681 6.604 4.417 0.000 16.225 42.111
word_freq_font 15.2619 19.085 0.800 0.424 -22.143 52.667
word_freq_000 217.2654 52.468 4.141 0.000 114.430 320.101
word_freq_money 37.2177 15.865 2.346 0.019 6.123 68.312
word_freq_hp -268.1836 37.433 -7.164 0.000 -341.551 -194.816
word_freq_george -1946.4215 319.258 -6.097 0.000 -2572.155 -1320.688
word_freq_650 53.0517 28.887 1.836 0.066 -3.567 109.670
word_freq_lab -231.6616 155.425 -1.491 0.136 -536.288 72.965
word_freq_labs -58.6854 47.956 -1.224 0.221 -152.677 35.306
word_freq_data -125.9851 45.157 -2.790 0.005 -214.491 -37.479
word_freq_415 -1160.4886 408.508 -2.841 0.005 -1961.150 -359.827
word_freq_technology 120.5670 36.498 3.303 0.001 49.032 192.102
word_freq_1999 10.9384 26.862 0.407 0.684 -41.711 63.588
word_freq_parts 147.1036 120.475 1.221 0.222 -89.023 383.230
word_freq_pm -98.5601 47.632 -2.069 0.039 -191.918 -5.202
word_freq_direct -40.9160 40.089 -1.021 0.307 -119.488 37.656
word_freq_cs -5359.3845 5581.531 -0.960 0.337 -1.63e+04 5580.216
word_freq_meeting -308.1263 110.470 -2.789 0.005 -524.644 -91.608
word_freq_original -248.3528 133.466 -1.861 0.063 -509.942 13.237
word_freq_project -143.7917 61.447 -2.340 0.019 -264.225 -23.358
word_freq_re -80.0238 16.100 -4.970 0.000 -111.580 -48.468
word_freq_edu -185.0862 36.148 -5.120 0.000 -255.934 -114.238
word_freq_table -304.1867 260.651 -1.167 0.243 -815.053 206.680
word_freq_conference -478.9719 200.538 -2.388 0.017 -872.019 -85.925
char_freq_semicolon -142.5491 54.074 -2.636 0.008 -248.532 -36.566
char_freq_round_bracket -18.7805 31.375 -0.599 0.549 -80.275 42.714
char_freq_square_bracket -107.0447 141.624 -0.756 0.450 -384.623 170.534
char_freq_exclamation 22.0186 6.082 3.620 0.000 10.098 33.939
char_freq_dollar 559.6474 81.671 6.852 0.000 399.575 719.719
char_freq_hash 303.2639 123.727 2.451 0.014 60.763 545.765
capital_run_length_average 0.1026 0.020 5.011 0.000 0.062 0.143
capital_run_length_total 0.0015 0.000 6.146 0.000 0.001 0.002


Possibly complete quasi-separation: A fraction 0.29 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
In [ ]:
test_probs = model.predict(email_test.dropna()) 
test_preds = test_probs.round().astype(int)
test_gt = email_test.dropna()['spam']

from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(test_gt, test_preds))
Classification Report
              precision    recall  f1-score   support

           0       0.91      0.93      0.92       691
           1       0.90      0.86      0.88       460

    accuracy                           0.90      1151
   macro avg       0.90      0.90      0.90      1151
weighted avg       0.90      0.90      0.90      1151

Reducing Predictors ¶

To enhance our logistic regression model, we implemented a strategy of simplifying the model by reducing the number of predictors. This was guided by the statistical significance of each predictor, as indicated by their P>|z| values.

  • Criterion for Predictor Removal:
    We chose to remove all features with P>|z| values greater than 0.05. This decision is based on the principle that features with a P>|z| value above this threshold are not statistically significant at the 5% level, implying that their contribution to the model in distinguishing spam from non-spam emails might be negligible.

  • Variables Removed:
    Based on the above criterion, the following variables were removed from the model:

    • word_freq_email
    • word_freq_people
    • word_freq_mail
    • ...
  • Outcome of the Model Refinement:
    The removal of these variables resulted in a lighter model with only 28 parameters. Notably, the Pseudo R-squared value of this refined model is 0.7027, which is slightly lower than the previous value but still indicates substantial explanatory power.

  • Model Convergence:
    A significant improvement with this simplified model is its convergence. Unlike the previous versions, this model successfully converged in just 19 iterations. This faster convergence suggests that the reduced model is more stable and efficient in fitting the data.

  • Interpretation:
    The slightly lower Pseudo R-squared value suggests a marginal reduction in the model's explanatory power. However, this trade-off is offset by the benefits of a more parsimonious model, which typically offers better generalizability and interpretability. With fewer predictors, each remaining variable in the model is likely to be more meaningful and significant in distinguishing between spam and non-spam emails.

In [ ]:
p_values = model.pvalues
if 'Intercept' in p_values.index:
    p_values.drop('Intercept', inplace=True)
    
ordered_p_values = p_values.sort_values(ascending=False).round(3)

useful_p_values = ordered_p_values[ordered_p_values < 0.05]
useful_attributes = useful_p_values.index.tolist()

useful_attributes
Out[ ]:
['word_freq_order',
 'word_freq_pm',
 'word_freq_project',
 'word_freq_money',
 'word_freq_conference',
 'char_freq_hash',
 'char_freq_semicolon',
 'word_freq_you',
 'word_freq_meeting',
 'word_freq_data',
 'word_freq_415',
 'word_freq_internet',
 'word_freq_over',
 'word_freq_technology',
 'word_freq_business',
 'char_freq_exclamation',
 'word_freq_000',
 'word_freq_your',
 'word_freq_re',
 'capital_run_length_average',
 'word_freq_edu',
 'word_freq_our',
 'word_freq_remove',
 'word_freq_george',
 'word_freq_free',
 'capital_run_length_total',
 'char_freq_dollar',
 'word_freq_hp']
In [ ]:
email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)

formula = "spam ~ " + " + ".join(useful_attributes)

model = logit(formula, email_train).fit()
summary = model.summary()
summary
Optimization terminated successfully.
         Current function value: 0.199105
         Iterations 19
Out[ ]:
Logit Regression Results
Dep. Variable: spam No. Observations: 3450
Model: Logit Df Residuals: 3421
Method: MLE Df Model: 28
Date: Tue, 19 Dec 2023 Pseudo R-squ.: 0.7027
Time: 14:56:17 Log-Likelihood: -686.91
converged: True LL-Null: -2310.5
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -1.9379 0.143 -13.579 0.000 -2.218 -1.658
word_freq_order 82.0203 36.764 2.231 0.026 9.963 154.077
word_freq_pm -105.1509 45.252 -2.324 0.020 -193.843 -16.459
word_freq_project -151.8314 62.519 -2.429 0.015 -274.366 -29.297
word_freq_money 46.2219 18.188 2.541 0.011 10.573 81.870
word_freq_conference -650.7634 226.439 -2.874 0.004 -1094.576 -206.950
char_freq_hash 375.1398 97.004 3.867 0.000 185.016 565.263
char_freq_semicolon -114.3148 35.136 -3.254 0.001 -183.179 -45.450
word_freq_you 11.5959 3.942 2.942 0.003 3.870 19.322
word_freq_meeting -308.4246 114.030 -2.705 0.007 -531.919 -84.931
word_freq_data -128.9851 42.758 -3.017 0.003 -212.790 -45.180
word_freq_415 -1290.3735 410.433 -3.144 0.002 -2094.807 -485.940
word_freq_internet 48.5591 14.959 3.246 0.001 19.239 77.879
word_freq_over 85.6357 28.570 2.997 0.003 29.640 141.631
word_freq_technology 128.9081 35.203 3.662 0.000 59.911 197.905
word_freq_business 101.1307 25.282 4.000 0.000 51.578 150.683
char_freq_exclamation 24.5126 6.684 3.667 0.000 11.411 37.614
word_freq_000 214.6662 51.363 4.179 0.000 113.996 315.337
word_freq_your 24.3360 6.019 4.043 0.000 12.539 36.133
word_freq_re -80.7811 15.837 -5.101 0.000 -111.822 -49.740
capital_run_length_average 0.1130 0.021 5.502 0.000 0.073 0.153
word_freq_edu -206.3152 37.021 -5.573 0.000 -278.875 -133.755
word_freq_our 79.6953 14.045 5.674 0.000 52.168 107.222
word_freq_remove 247.4560 39.337 6.291 0.000 170.356 324.556
word_freq_george -2096.4694 330.758 -6.338 0.000 -2744.743 -1448.196
word_freq_free 114.3631 17.211 6.645 0.000 80.630 148.096
capital_run_length_total 0.0015 0.000 6.823 0.000 0.001 0.002
char_freq_dollar 615.1742 84.351 7.293 0.000 449.849 780.499
word_freq_hp -275.2287 35.481 -7.757 0.000 -344.771 -205.686


Possibly complete quasi-separation: A fraction 0.27 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Logistic regression results¶

The classification report provides insights into the performance of our logistic regression model:

  • Precision: We achieve a high precision score of 0.90 for both classes (0 and 1), indicating a low false positive rate. This means that 90% of the instances predicted as spam or not spam are correct.

  • Recall: For class 0, the recall score is 0.94, indicating that the model correctly identifies 94% of all actual non-spam instances. However, for class 1, the recall is slightly lower at 0.85, suggesting that 15% of actual spam instances were not captured by the model.

  • F1-Score: The F1-score, which is the harmonic mean of precision and recall, is 0.92 for non-spam and 0.88 for spam. This confirms the balanced classification capability of the model.

  • Accuracy: Overall, the model achieves an accuracy of 0.90, which remains consistent across the macro average and weighted average. This underscores the model's robustness in correctly classifying emails as either spam or not spam.

In [ ]:
test_probs = model.predict(email_test.dropna()) 
test_preds = test_probs.round().astype(int)
test_gt = email_test.dropna()['spam']
In [ ]:
plt.rcParams.update({'figure.figsize':(8,8), 'figure.dpi':100})
conf_matrix = metrics.confusion_matrix(test_gt, test_preds)

cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = conf_matrix, display_labels = ['non-spam', 'spam'])

cm_display.plot()
plt.show()
No description has been provided for this image
In [ ]:
from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(test_gt, test_preds))
Classification Report
              precision    recall  f1-score   support

           0       0.90      0.94      0.92       691
           1       0.90      0.85      0.88       460

    accuracy                           0.90      1151
   macro avg       0.90      0.89      0.90      1151
weighted avg       0.90      0.90      0.90      1151

Support Vector Machine (SVM) ¶

SVM Training with Grid Search ¶

We conducted a comprehensive parameter optimization using Grid Search. This approach focused on varying the kernel parameter, testing four different types: linear, poly (polynomial), rbf (radial basis function), and sigmoid.

Kernel Types and Their Differences¶

Each kernel type represents a different approach to transforming the input data into a higher-dimensional space:

  • Linear Kernel: Simplistic and effective for linearly separable data, where a straight line can separate the classes.
  • Polynomial Kernel (poly): Suitable for non-linearly separable data, allowing the model to adapt to more complex relationships by raising the data to a specified power.
  • Radial Basis Function (rbf): Highly effective for non-linear data, as it can handle complex, multidimensional relationships by measuring the distance from a central point.

Best Model: Linear Kernel¶

Remarkably, the linear kernel emerged as the best model for the Spambase dataset, indicating that despite potential complexities and non-linearities, the data is predominantly linearly separable. This suggests that a simpler linear decision boundary was sufficient and more effective for this specific dataset, avoiding the overfitting or unnecessary complexity that might arise with higher-order kernels.

Cross-Validation Accuracy and Comparison with Logistic Regression¶

The accuracy obtained from cross-validation with the SVM model was 0.75, which is lower than what was achieved using logistic regression. This disparity could be attributed to several factors. It can provide a more flexible fit to the data compared to SVM, which seeks to maximize the margin between classes. Additionally, the effectiveness of logistic regression in this context can be due also to its simplicity and robustness, particularly when dealing with datasets that, while potentially complex, still exhibit a strong linear component in their feature relationships.

In [ ]:
data_useful_attributes = data[useful_attributes + ['spam']].copy()
X_data = data_useful_attributes.drop(columns=['spam'])  # Features
y_data = data_useful_attributes['spam']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X_data,y_data, test_size=0.25, random_state=0)
In [ ]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define the parameter grid
param_grid = {'kernel': ['linear', 'poly', 'rbf']}


svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')

# Perform grid search
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

best_svm = grid_search.best_estimator_
Best Parameters: {'kernel': 'linear'}
Best Score: 0.7460869565217391
In [ ]:
# Evaluate the best model on the test set
y_pred = best_svm.predict(X_test)
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred))
Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.72      0.94      0.82       691
           1       0.83      0.46      0.59       460

    accuracy                           0.75      1151
   macro avg       0.78      0.70      0.71      1151
weighted avg       0.77      0.75      0.73      1151

Impact of Data Normalization ¶

The Support Vector Machine (SVM) classification algorithm exhibited a noteworthy increase in accuracy upon the application of data normalization. Initially, the SVM model produced an accuracy of 0.75. However, after normalizing the dataset, the accuracy improved dramatically to 0.91. This enhancement underscores the profound impact that feature scaling can have on the performance of SVM.

This happens because Support Vector Machines are fundamentally sensitive to the scale of the input features due to the way they are designed to maximize the margin between different classes. In the absence of normalization, features with larger scales can distort this margin, giving undue weight to certain variables and potentially misguiding the optimization process of the SVM. Normalization brings each feature onto the same scale, making the distance measure more consistent across different dimensions.

In [ ]:
scaler = StandardScaler()

# Fit on training set only
scaler.fit(X_train)

# Apply transform to both the training set and the test set
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the SVM Classifier on the scaled data
svm_model = SVC(kernel='linear')  # You can change the kernel as needed
svm_model.fit(X_train_scaled, y_train)

# Predict on the scaled test data
y_pred = svm_model.predict(X_test_scaled)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       691
           1       0.91      0.87      0.89       460

    accuracy                           0.91      1151
   macro avg       0.91      0.91      0.91      1151
weighted avg       0.91      0.91      0.91      1151

Decision Tree ¶

A decision tree is a machine learning algorithm that partitions the data into subsets based on the value of input features. It is akin to a flowchart where each internal node represents a test on an attribute, each branch corresponds to an outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). This model is popular for its interpretability and ease of use.

In the context of the Spambase dataset, we have opted for a decision tree model due to its effectiveness in handling categorical and continuous data, and its capability to model complex decision boundaries. Decision trees can also inherently perform feature selection, which can be particularly advantageous given the high dimensionality of the Spambase dataset.

As part of our modeling process, we will experiment with different max_depth values, which determine the maximum length of the paths from the root to the leaves. This is crucial for controlling the complexity of the tree and preventing overfitting. Additionally, we will explore various pruning parameters to refine the tree structure, ensuring that it generalizes well to new data. By tuning these parameters, we aim to build an optimized decision tree model that effectively classifies emails as spam or not spam while maintaining interpretability.

In [ ]:
def plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title=""):
    # Confusion Matrix
    conf_matrix = metrics.confusion_matrix(y_test, y_test_predict)
    cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Non-Spam', 'Spam'])

    # Classification Report
    class_report = classification_report(y_test, y_test_predict, output_dict=True)
    df_report = pd.DataFrame(class_report).transpose()

    # Plotting
    fig, ax = plt.subplots(1, 2, figsize=(16, 8))

    # Confusion Matrix
    cm_display.plot(ax=ax[0])
    ax[0].set_title('Confusion Matrix')
    

    # Classification Report Metrics
    df_report.iloc[:-3, :-1].plot(kind='bar', ax=ax[1])
    ax[1].set_title('Classification Report Metrics')
    ax[1].set_xticklabels(['Non-Spam', 'Spam'], rotation=0)
    
    fig.suptitle(title, fontsize=16)

    plt.tight_layout()
    plt.show()
In [ ]:
data_useful_attributes = data[useful_attributes + ['spam']].copy()
X_data = data_useful_attributes.drop(columns=['spam'])  # Features
y_data = data_useful_attributes['spam']  # Target variable
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_data,y_data, test_size=0.25, random_state=0)
In [ ]:
dt = DecisionTreeClassifier(max_depth=3, random_state=0)
train_tree_preds = dt.fit(X_train,y_train)

y_test_predict = dt.predict(X_test)

plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Decision Tree with max_depth=3")
No description has been provided for this image
In [ ]:
print("Classification Report")
print(classification_report(y_test, y_test_predict))
Classification Report
              precision    recall  f1-score   support

           0       0.86      0.95      0.90       691
           1       0.91      0.76      0.83       460

    accuracy                           0.88      1151
   macro avg       0.89      0.86      0.87      1151
weighted avg       0.88      0.88      0.87      1151

In [ ]:
dot_data = export_graphviz(dt, out_file=None, feature_names=X_data.columns, class_names=['Non-Spam', 'Spam'], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph
Out[ ]:
No description has been provided for this image

Grid Search and Cross Validation ¶

A DecisionTreeClassifier was trained with a max_depth parameter set to 3. This depth, representing the maximum length from the root to a leaf, aims to prevent overfitting by limiting the complexity of the decision tree. With this setting, the classifier achieved an accuracy score of (0.88), indicating a high level of prediction capability.

To further refine our model, we employ GridSearchCV, an exhaustive search over specified parameter values for an estimator. The parameters we are tuning are:

  1. max_depth: ([5, 10, 15, 20]) - These values represent various levels of tree depth. A deeper tree (higher max_depth) can model more complex patterns but risks overfitting.

  2. ccp_alpha: ([0.0, 0.001, 0.01, 0.1]) - Cost-Complexity Pruning (CCP) alpha is used to prune the tree to avoid overfitting. It's a complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. A higher value of ccp_alpha will prune more aggressively.

The GridSearchCV results are:

  • Best Parameters: {ccp_alpha: 0.001, max_depth: 15}. This implies that a tree depth of 15 and a slight pruning (CCP alpha of 0.001) yield the best trade-off between model complexity and generalization ability.

  • Best Accuracy Score: (0.8987). This score is an improvement over the initial model, underscoring the efficacy of parameter tuning.

  • Depth of Best Tree: 11. Interestingly, even though the best max_depth parameter was 15, the actual depth of the best-performing tree turned out to be 11, indicating that the optimal complexity for this dataset is achieved before reaching the maximum allowed depth.

In [ ]:
param_grid = {
    'max_depth': [5, 10, 15, 20],  # Different max_depth values to test
    'ccp_alpha': [0.0, 0.001, 0.01, 0.1]  # Different pruning parameters to test
}

clf = DecisionTreeClassifier(random_state=0)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')


grid_search.fit(X_data, y_data)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy Score:", grid_search.best_score_)
print("Depth of Best Tree:", grid_search.best_estimator_.get_depth())
Best Parameters: {'ccp_alpha': 0.001, 'max_depth': 15}
Best Accuracy Score: 0.8987135910871926
Depth of Best Tree: 11
In [ ]:
# Train the Decision Tree Classifier
dt = DecisionTreeClassifier(ccp_alpha=0.001, max_depth=15, random_state=0)
dt.fit(X_train, y_train)
y_test_predict = dt.predict(X_test)

plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Decision Tree with max_depth=15 and ccp_alpha=0.001")


dt = DecisionTreeClassifier(max_depth=3, random_state=0)
dt.fit(X_train, y_train)
y_test_predict = dt.predict(X_test)
plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Decision Tree with max_depth=3")
No description has been provided for this image
No description has been provided for this image

Random Forest ¶

We also explored the Random Forest algorithm as an alternative approach. A Random Forest is an ensemble learning method, which operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of individual trees.

Random Forest classifier was configured with n_estimators=100 (default), signifying that 100 trees would be built in the forest, and max_depth=3, limiting the depth of each tree to prevent overfitting. The ensemble nature of Random Forest typically yields a more accurate model compared to a single decision tree due to its ability to average out biases and reduce variance.

However, the mean accuracy over cross-validation for the Random Forest model was observed to be (0.9039).

This accuracy represents only a marginal improvement over the previously tested methods. It suggests that the Decision Tree model is already performing well and is well-tuned (like with optimal depth and pruning).

In [ ]:
rf_model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=0)

rf_model.fit(X_train, y_train)
y_test_predict = rf_model.predict(X_test)

plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Random Forest with n_estimators=100 and max_depth=3")
No description has been provided for this image
In [ ]:
scores = cross_val_score(rf_model, X_data, y_data, cv=5, scoring='accuracy')

print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
Cross-Validation Scores: [0.9218241  0.90652174 0.93043478 0.91413043 0.84673913]
Mean Accuracy: 0.9039300382382098
In [ ]:
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_predict))
Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.87      0.98      0.92       691
           1       0.96      0.78      0.86       460

    accuracy                           0.90      1151
   macro avg       0.92      0.88      0.89      1151
weighted avg       0.91      0.90      0.90      1151

K-Nearest Neighbors (K-NN) ¶

We also tried to employ the K-Nearest Neighbors (K-NN) algorithm, a well-known method in machine learning for its simplicity and effectiveness in classification tasks. Our approach was to fine-tune the model to determine the optimal number of neighbors (k) that would yield the best classification accuracy.

Grid Search for Hyperparameter Tuning ¶

To find the most suitable k value, we implemented a Grid Search strategy, varying k from 1 to 60. This exhaustive search allowed us to systematically traverse through a wide range of k values, aiming to pinpoint the one that maximizes the accuracy of our K-NN classifier.

Observations: Impact of Increasing k on Accuracy¶

One of the key observations from our analysis was the inverse relationship between the size of k and the accuracy of the model. Specifically, as k increased, there was a noticeable decline in accuracy. This trend can be attributed to the intrinsic workings of the K-NN algorithm. When k is small, the algorithm tends to capture the noise in the data, leading to overfitting. However, as k grows, the classifier starts to consider a broader set of neighbors for each query point. While this can reduce the impact of noise, it also increases the likelihood of including points from other classes within the neighborhood, consequently diluting the decision boundaries and diminishing the classifier's ability to distinguish accurately between classes.

Optimal Model: k = 2¶

The Grid Search identified that the model achieved its peak performance with k = 2. This suggests that a tighter, more localized decision boundary is preferable for this particular dataset, as it helps to maintain a balance between reducing noise and preserving the integrity of the class boundaries.

Cross-Validation Results and Comparison with Other Methods¶

Despite identifying an optimal k, the mean accuracy achieved through cross-validation was only 0.6955. This performance is considerably lower when compared to other classification methods applied to the same dataset. For instance, using the Random Forest algorithm, we obtained a mean accuracy of 0.9039. The relatively lower efficiency of the K-NN model in this scenario can be primarily attributed to the characteristics of the Spambase dataset. Given its high dimensionality and potential noise, distance-based methods like K-NN face challenges.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_data,y_data, test_size=0.25, random_state=0)

X_train = np.ascontiguousarray(X_train)
X_test = np.ascontiguousarray(X_test)
y_train = np.ascontiguousarray(y_train)
y_test = np.ascontiguousarray(y_test)
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define a range of 'k' values for K-NN
k_range = list(range(1, 61))

# Create a K-NN classifier
knn = KNeighborsClassifier()

# Create a dictionary of all values we want to test for 'n_neighbors'
param_grid = dict(n_neighbors=k_range)

# Use grid search to test all values for 'n_neighbors'
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')


grid.fit(X_train, y_train)
grid_results = grid.cv_results_

# Extract the mean test scores for each parameter
mean_test_scores = grid_results['mean_test_score']


plt.figure(figsize=(12, 6))
plt.plot(k_range, mean_test_scores, color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Accuracy vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()


best_k = grid.best_params_['n_neighbors']
best_score = grid.best_score_

print("Best K Value:", best_k)
print("Best Score:", best_score)
No description has been provided for this image
Best K Value: 2
Best Score: 0.711304347826087
In [ ]:
X_data = np.ascontiguousarray(X_data)
y_data = np.ascontiguousarray(y_data)

# Use the best parameter to make predictions
knn_best = KNeighborsClassifier(n_neighbors=best_k)

scores = cross_val_score(knn_best, X_data, y_data, cv=5, scoring='accuracy')

print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
Cross-Validation Scores: [0.66340934 0.70108696 0.73478261 0.71630435 0.66195652]
Mean Accuracy: 0.6955079544918095
In [ ]:
X_train = np.ascontiguousarray(X_train)
X_test = np.ascontiguousarray(X_test)
y_train = np.ascontiguousarray(y_train)

knn_best.fit(X_train, y_train)
y_test_predict = knn_best.predict(X_test)

print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_predict))
Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.71      0.91      0.79       691
           1       0.76      0.44      0.55       460

    accuracy                           0.72      1151
   macro avg       0.73      0.67      0.67      1151
weighted avg       0.73      0.72      0.70      1151

Impact of Data Normalization ¶

A critical aspect of our study involved the normalization of data prior to the application of the K-Nearest Neighbors (K-NN) algorithm. By standardizing the feature set, we observed a substantial improvement in the model's performance: the accuracy surged from 0.72 to an impressive 0.89. This significant enhancement in accuracy highlights the importance of normalization in the preprocessing phase, particularly for distance-based algorithms like K-NN.

In [ ]:
scaler = StandardScaler()

# Fit on training set only
scaler.fit(X_train)

# Apply transform to both the training set and the test set
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_best.fit(X_train_scaled, y_train)
y_test_predict = knn_best.predict(X_test_scaled)

print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_predict))
Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.87      0.97      0.91       691
           1       0.94      0.78      0.85       460

    accuracy                           0.89      1151
   macro avg       0.90      0.87      0.88      1151
weighted avg       0.90      0.89      0.89      1151

Conclusion ¶

We evaluated a variety of classification algorithms, including Logistic Regression, Logistic Regression with Backward Feature Elimination (BFE), Support Vector Machine (SVM), SVM with Normalized Data, Decision Trees, Random Forest, K-Nearest Neighbors (K-NN), and K-NN with Normalized Data. The performance of these algorithms was compared based on three metrics: accuracy, macro average, and weighted average.

Our findings suggest that, overall, the classification algorithms exhibited similar performance. Notably, Logistic Regression, both with and without BFE, Random Forest, and Decision Trees demonstrated robustness in accuracy and consistency across the metrics without the need for data normalization. However, it was observed that SVM and K-NN algorithms significantly benefited from data normalization. This improvement underscores the importance of preprocessing steps in data analysis, particularly when utilizing algorithms sensitive to the scale of the data, enhancing their ability to classify the data more effectively, which is evident from the increased scores in all evaluation metrics post-normalization. Such an enhancement is attributed to the fact that both SVM and K-NN rely on distance calculations, which are profoundly affected by the scale of the features.

In summary, the comparative study has provided valuable insights into the strengths and limitations of each algorithm when applied to the Spambase dataset. The key takeaway is the critical role of data preprocessing and the selection of appropriate algorithms based on the data characteristics and the desired outcome of the model.

In [ ]:
data = {
    'Algorithm': ['Logistic Regression', 'Logistic Regression BFE', 'SVM', 'SVM Normalized', 'Decision Trees', 'Random Forest', 'K-NN', 'K-NN Normalized'],
    'accuracy': [0.90, 0.90, 0.75, 0.91, 0.88, 0.90, 0.72, 0.89],
    'macro avg': [0.90, 0.90, 0.71, 0.91, 0.87, 0.89, 0.67, 0.88],
    'weighted avg': [0.90, 0.90, 0.73, 0.91, 0.87, 0.90, 0.70, 0.89]
}

df = pd.DataFrame(data)
df = pd.melt(df, id_vars="Algorithm", var_name="Study Cases", value_name="Accuracy")
plt.figure(figsize=(12, 8))
g = sns.barplot(x='Algorithm', y='Accuracy', hue='Study Cases', data=df)
g.set_yticks(np.arange(0, 1.01, 0.05))
plt.title('Comparative Analysis of Classification Algorithms')
plt.xlabel('Algorithms')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image